-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Multithreading for Enhanced Performance in Custom Check Processing #284
Conversation
pyQuARC/code/checker.py
Outdated
for field_dict in list_of_fields_to_apply: | ||
executor.submit( | ||
self._process_field, | ||
func, | ||
check, | ||
rule_id, | ||
metadata_content, | ||
field_dict, | ||
result_dict, | ||
rule_mapping, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't running this in parallel lead to missing values? Eg:
result_dict = {}
Running this method in parallel will pass result_dict
to the number of parallel method calls. when updating parallelly, wouldn't result_dict
be missing some elements?
Not sure if pass by value takes care of the issue. proper testing (unit and manual testing) is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did manual unit and integration tests with a curated list of concept ids, results obtained from pyquarc with and without using multithreading are exactly same.
pyQuARC/code/custom_checker.py
Outdated
with ThreadPoolExecutor() as executor: | ||
future_results = [] | ||
for arg in args: | ||
future = executor.submit( | ||
self._process_argument, | ||
arg, | ||
func, | ||
relation, | ||
external_data, | ||
external_relation, | ||
invalid_values, | ||
validity, | ||
) | ||
future_results.append(future) | ||
|
||
# Retrieve results from futures | ||
for future in future_results: | ||
invalid_values, validity = future.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't sub-threading be an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did manual testing with a curated list of concept ids, no issue at all.
Description:
This pull request introduces a multithreading solution to enhance the performance of custom check processing in the codebase. The existing codebase comprises various rules applied to multiple fields. During execution, it became evident that parallel processing is necessary at two levels: the field level and the argument level.
At the field level, a check needs to be performed for multiple fields simultaneously. Meanwhile, at the argument level, a check traverses through multiple arguments within a single field. Therefore, nested multithreading is required to fully improve the overall performance of the project.
By leveraging multithreading at both levels, we aim to parallelize the execution of checks, thereby significantly improving performance and efficiency, especially in scenarios involving a large number of checks or resolving URLs.
Changes:
process_argument
in custom_checker.py and_process_field
in checker.py to segregate the processes for parallel execution.Testing:
Extensive testing has been conducted to ensure the correctness and performance of the multithreading solution.
Integration tests have been performed to validate the code's functionality in various scenarios and edge cases.
Impact:
This change significantly improves the performance of custom check processing, especially in scenarios involving a large number of checks or resolving URLs.
Improved the efficiency of the execution on average from
~100s
to~10s